Skip to content

Conversation

H0TB0X420
Copy link

  • Strip quotes from column names in drop() method
  • Maintains consistency with other DataFrame operations
  • Both drop('col') and drop('col') now work

Rationale: CSV files with capitalized headers require quotes in select() operations but drop() failed when quotes were provided, creating inconsistent behavior across DataFrame methods.

User facing changes: users can now use either drop('col') or drop('"col"') consistently, matching the behavior of other DataFrame operations like select().

Closes #1212

This is one of my first PRs, please let me know what I can improve!

- Strip quotes from column names in drop() method
- Maintains consistency with other DataFrame operations
- Both drop('col') and drop('col') now work

Fixes apache#1212
@HeWhoHeWho
Copy link

HeWhoHeWho commented Sep 19, 2025

Hi thanks for the PR.

Rationale: CSV files with capitalized headers require quotes in select() operations but drop() failed when quotes were provided, creating inconsistent behavior across DataFrame methods.

User facing changes: users can now use either drop('col') or drop('"col"') consistently, matching the behavior of other DataFrame operations like select().

In the context of CSV capitalised col header, I just tested out select('col') without double quotes in the current version, it raised an Exception Error: FieldNotFound. I believe select() only accepts double quotes to parse capitalised col header i.e. select('"col"').

Of course, it'd be great if one can opt to use double quote or without to parse capitalised col header for DataFrame operations like select(), drop(), sort(), etc.

Let me know if I misunderstood the context.

Comment on lines 415 to 417
Returns:
DataFrame with those columns removed in the projection.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. This looks good. One request: Would you mind updating the docstring to specify that column case is respected and does not need double quotes like other operations such as select? You can also specify that leading and trailing " are allowable as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docstring.
Based on the comment from @HeWhoHeWho should we open another issue for select(), drop(), sort() or should we expand the scope of this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop is unique in that it only takes in column names. The others differ in that they take expressions. We have some syntactic sugar around them to allow passing in strings that get turned into expressions. From the perspective of datafusion, requiring the double quotes for select and sort is the expected behavior. It is also included in the online documentation with a call out box to emphasize it. So I think we are okay with just this PR as it is.

If people want the behavior of select and sort changed then we should consider adding an upstream change to the datafusion repo with a configuration setting to do this. I tested with the datafusion.sql_parser.enable_ident_normalization but that didn't seem to make an impact.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people want the behavior of select and sort changed then we should consider adding an upstream change to the datafusion repo with a configuration setting to do this. I tested with the datafusion.sql_parser.enable_ident_normalization but that didn't seem to make an impact.

Say we go by - if Expression is used, then force the usage of double quote, otherwise allow select('col') or select('"col"'), same goes for sort.
Would this be a big change on upstream at datafusion repo?

- Document that column names are case-sensitive and don't require quotes
- Clarify that both quoted and unquoted column names are accepted
- Add examples showing both 'col' and 'col' syntax work
- Note difference from select() operation behavior
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

@timsaucer
Copy link
Contributor

@H0TB0X420 it looks like you have some ruff formatting errors. They should be pretty easy to fix: https://github.com/apache/datafusion-python/actions/runs/17882197033/job/50875737252?pr=1242

@timsaucer
Copy link
Contributor

I'm running CI now!

It looks like there were a couple of changes that we needed and I pushed. First there were whitespace errors and some lines that were too long. We use ruff for our linter and it is configured with pre-commit so you can often catch these problems before you push to github. If you're not familiar with it, I highly suggest installing and using it. It's a very useful tool.

The second was that the documentation did not render correctly. RestructuredText (rst) can be fickle if you haven't used it before. It needs double back ticks rather than single ticks to indicate something should render as monospaced variables. Also in order to get the python example to work the line before it needs to end with a double colon ::. Like I said, it's picky when it comes to how these need to be written in order to render properly. But it's been a long established method for code documentation in Python, so often work becoming familiar with.

Thank you again for the PR! Assuming CI passes I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop column syntax inconsistency
3 participants